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Regulating Greed Over Time 
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Abstract 

In retail, there are predictable yet dramatic time-dependent patterns in customer behavior, such as periodic changes in the 
number of visitors, or increases in visitors just before major holidays (e.g., Christmas). The current paradigm of multi-armed 
handit analysis does not take these known patterns into account, which means that despite the firm theoretical foundation of 
these methods, they are fundamentally flawed when it comes to real applications. This work provides a remedy that takes 
the time-dependent patterns into account, and we show how this remedy is implemented in the UCB and e-greedy methods. 
In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher reward 
periods, and more exploration occurs in periods of low reward. In order to understand why regret is reduced with the corrected 
methods, we present a set of bounds that provide insight into why we would want to exploit during periods of high reward, 
and discuss the impact on regret. Our proposed methods have excellent performance in experiments, and were inspired by a 
high-scoring entry in the Exploration and Exploitation 3 contest using data from Yahoo! Eront Page. That entry heavily used 
time-series methods to regulate greed over time, which was suhstantially more effective than other contextual bandit methods. 
Keywords: Multi-armed handit, exploration-exploitation trade-off, time series, retail management, marketing, online appli¬ 
cations, regret bounds. 


1 Introduction 

Consider the classic pricing problem faced by retailers, where the price of a product is chosen to maximize the expected profit. 
The optimal price is learned through a mix of exploring various pricing choices and exploiting those known to yield higher 
profits. Exploration-exploitation problems occur not only in retail (both in stores and online), but in marketing, on websites 
such as Yahoo! Front Page, where the goal is to choose which of a set of articles to display to the user, on other webpages 
where ads are shown to the user on a sidebar (e.g., Facebook, Slashdot), and even on websites like YouTube that recommend 
the next video (and relative targeted ad) to watch. In all of these applications, the classic assumptions of random rewards with 
a static probability distribution is badly violated. For retailers, there are almost always clear trends in customer arrivals, and 
they are often predictable. For instance. Figure [T^ shows clear weekly periodicity of customer arrivals, with a large dip on 
Christmas Eve and a huge peak on Boxing Day and Figure[^shows clear half-day periodicity of the number of Skype users. 
These dramatic trends might have a substantial impact on which policy we would use to show ads and price products - if we 
knew how to handle them - yet these trends are completely ignored in classic multi-armed bandit analysis, where we assume 
each arm has a static hxed distribution of rewards. 

The time series patterns within the application areas discussed above take many different forms, but in all cases, the same 
effect is present, where regulating the rate of exploitation (i.e., greed) over time could be benehcial. For retailers, the number 
of customers is much larger on certain days than others, and for these days we should exploit by choosing the best prices, 
and not explore. For Yahoo! Front Page, articles have a short lifespan and some articles are much better than others, in which 
case, if we find a particularly good article, we should exploit by repeatedly showing that one, and not explore new articles. 
For online advertising, there are certain periods where customers are more likely to make a purchase, so prices should be set 
to optimal values during those times, and we should not be exploring suboptimal choices. 

In this work, we consider a simple but effective way to model a time-dependent effect on the rewards, which is that the 
static rewards for all the possible choices (the so-called arms, e.g. the ads shown on a website) are multiplied by a known 
time series, the reward multiplier G{t). Even in this simple case, we can intuitively see how exploiting when the reward 
multiplier is high and exploring when it is low can lead to better performance. The reward multiplier approach is general, 
and can encode “micro-lock” (i.e., playing many times the same arm in one round) or “lock” periods (i.e., playing the same 
arm in a sequence of rounds) where the prices can not change, but the purchase rate has dynamically changing trends. As a 
result of the reward multiplier function, theoretical regret bound analysis of the multi-armed bandit problem becomes more 
complicated, because now the distribution of rewards depends explicitly on time. We not only care how many times each 
suboptimal arm is played, but exactly when they are played. For instance, if suboptimal arms are played only when the reward 
multiplier is low, intuitively it should not hurt the overall regret; this is one insight that the classic analysis cannot provide. 
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The standard algorithms are not able to be used for the setting in which the rewards are altered by the multiplier function 
(i.e. in case of “micro-lock” times). This is because they would incorrectly estimate the mean rewards of the arms. 

The main contributions of this work are: 

• A new framework that illustrates when it is beneficial to stop exploration sometimes to favor exploitation; 

• Novel algorithms that show how to adapt existing policies to regulate greed over time: 

- Algorithm[T] e-greedy algorithm with regulating threshold (Section [3^ ; 

- Algorithm]^ soft e-greedy algorithm (Section [33] ); 

- Algorithm]^ UCB algorithm with regulating threshold (Section [X4l l; 

- Algorithm]^ soft UCB algorithm (Section [33] l. 

• Theoretical regret bounds for the above algorithms; 

• Numerical comparisons (in Section]^ with improved versions of the classic e-greedy algorithm (Algorithm]^ and UCB 
algorithm (Algorithm]^. 

The new algorithms, that take advantage of the shape of the multiplier function G{t), are not a simple extension of the e- 
greedy algorithm and the UCB algorithm. The new algorithms perform substantially better than even “smarter” versions of 
the e-greedy algorithm and the UCB algorithm (see Algorithm]^ and Algorithm]^ in Section]^ where the rewards obtained 
at each round are properly calculated. This is because the algorithms we propose regulate greed over time. 

The multiplier function is assumed to be known and bounded. If the function G{t) is not known in advance, it may be 
easy to predict. For example, it may be easy to estimate the interest in a specific product at a particular day and time. Figure 
[T] shows a very clear trend for Google searches for “strawberries”. 
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Figure 1: Google searches for “strawberries”. Source: Google trends. 



2005 


2007 


2009 


2011 


2013 


2015 


Moreover, it would be also possible to estimate G{t) at each step t of the algorithm, since the knowledge of the function 
in future steps is not necessary. Figure shows the trend of Google searches for “scarf’ in the United States. The higher 
demand for scarves in 2014 was predictable in the short term due to the particularly cold winter of that year. 

Figure 2: Google searches for “scarf’. Source: Google trends. 



The ideas in this paper were inspired by a high scoring entry in the Exploration and Exploitation 3 Phase 1 data mining 
competition, where the goal was to build a better recommendation system for Yahoo! Eront Page news articles. At each time, 
several articles were available to choose from, and these articles would appear only for short time periods and would never 
be available again. One of the main ideas in this entry was simple yet effective: if any article gets more than 9 clicks out of 
the last 100 times we show the article, and keep displaying it until the clickthrough rate goes down. This alone increased the 
clickthrough rate by almost a quarter of a percent. The way we distilled the problem within this paper allows us to isolate and 
study this effect. 


2 Related Work 


The setup of this work differs from other works considering time-dependent multi-armed bandit problems - we do not assume 
the mean rewards of the arms exhibit random changes over time, and we assume that the reward multiplier is known in 
advance, in accordance with what we observe in real scenarios. No previous works that we know of consider regulating greed 
over time based on known reward trends. In contrast, Liu et al. | 2013| consider a problem where each arm transitions in an 
unknown Markovian way to a different reward state when it is played, and evolves according to an unknown random process 
when it is not played. [Garivier and Moulines|p008) presented an analysis of a discounted version of the UCB and a sliding 
window version of the UCB, where the distribution of rewards can have abrupt changes and stays stationary in between. 
Besbes et al. 1 2014| considers the case where the mean rewards for each arm can change, where the variation of that change is 
bounded. Slivkins and Upfal |2007| consider an extreme case where the rewards exhibit Brownian motion, leading to regret 


bounds that scale differently than typical bounds (linear in T rather than logarithmic). One of the works that is extremely 
relevant to ours is that of |Chakrabarti et aT 120091 who consider “mortal bandits” that disappear, just like Yahoo! Eront Page 
news articles or online advertisements. The setting of mortal bandits in the Yahoo! dataset inspired the framework proposed 
here, since in the mortal setting, we claim one would want to curb exploration when a high reward mortal bandit is available. 


An interesting setting is discussed by Komiyama et al. |20131, where there are lock-up periods when one is required to play 
the same arm several times in a row. In our scenario, the micro-lock-up periods occur at each step of the game, and their 
effective lengths are given by G{t). In some of the algorithms that we propose, when the rewards multiplier are above a 
certain threshold, we create lock-up periods ourselves, where exploration is stopped, but the best arm is allowed to change 
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during high-reward zones as we gather information over rounds. The methods proposed in this work are contextual Li et al. 
[e.g., 2010|, in the sense that we consider externally available information in the form of time dependent trends. 


3 Algorithms for regulating greed over time 

This section illustrates the problem, the proposed algorithms to regulate greed over time, and theoretical results on the bound 
on the expected regret of each policy. 

3.1 Problem setup 

Formally, the stochastic multi-armed bandit problem with regulated greed is a game played in n rounds. At each round t 
the player chooses an action among a finite set of m possible choices called arms (for example, they could be ads shown 
on a website, recommended videos and articles, or prices). When arm j is played (j G {1, • • • ,rn}) an unsealed random 
reward Xj{t) is drawn from an unknown distribution and the player receives the scaled reward Xj{t)G{t) where G{t) is the 
multiplier function. The distribution of Xj{t) does not change with time (the index t is just used to indicate in which turn 
the reward was drawn), while G{t) is a known function of time assumed to be bounded (this is, for instance, the number of 
searches for a particular item on Google or the number of users on Skype). At each turn, the player suffers also a possible 
regret from not having played the best arm; the mean regret for having played arm j is given by Aj = p* — p,j, where /r* is 
the mean reward of the best arm (indicated by “*”) and p,j is the mean reward obtained when playing arm j. At the end of 
each turn the player can update her estimate of the mean reward of arm j: 

= g (1) 

where Tj (f — 1) is the number of times arm j has been played before round t starts. This update will hopefully help the player 
in choosing a good arm in the next round. The total regret at the end of the game is given by 

n m 

= (2) 
t=i i=i 

where is an indicator function equal to 1 if arm j is played at time t (otherwise its value is 0). The strategies presented 

in the following sections aim to minimize the expected cumulative regret by regulating exploitation (i.e., greed) of the 

best arm found so far, and exploration based on the values of the multiplier function G{t). In general, when the multiplier 
function is high, the player risks to incur in high regret if a bad arm is played. We show that it is beneficial to stop exploration 
in this situation and resume exploration when rewards and regrets are lower. A complete list of the symbols used throughout 
the paper can be found in Appendix [D| 

3.2 Regulating greed with threshold in the e-greedy algorithm 

In Algorithm we present a variation of the e-greedy algorithm of | Auer et al.| | [200^ , in which a threshold z has been 
introduced in order to regulate greed. An optimal threshold z can also be estimated by running the algorithm on past data and 
by evaluating the one that gives the lowest regret. At each turn t, when the rewards are “high” (i.e., the G{t) multiplier is 
above the threshold z) the algorithm exploits the best arm found so far, that is, arm j with the highest mean estimate given in 
equation ©■ When the rewards are “low” (i.e., the G{t) multiplier is under the threshold z), the algorithm will explore with 
probability St = min { 1, ^ } an arm at random (each arm has probability 1 jm of being selected). The number i counts how 
many times the multiplier function has been under the threshold up to time t, while the constant k is greater than 10 and such 
that k > ^ . The reason of this choice is clear by looking at the expression of /3j (i) which is a bound on the probability 

of considering incorrectly a suboptimal arm j being the best choice. By setting the parameter k accordingly, we can ensure 
the logarithmic bound on the expected cumulative regret over the number of rounds (because the et are 0 (l/t) and their sum 
over time is logarithmically bounded, while the f3j{i) term is o (l/t)). 

The following theorem provides a bound on the mean regret of this policy (the proof is given in Appendix [A|. 
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Algorithm 1: £-greedy algorithm with regulating threshold 
Input ; number of rounds n, number of arms m, threshold z, a constant k > 10, such that k > ^ 

sequences = min |l, and {G(f)}"^i 

Initialization: play all arms once and initialize Xj (dehned in ([T])) for each j = 1, • • • , m 
for f = m + 1 to n do 
if fG(f) < z) then 

with probability £t play an arm uniformly at random (each arm has probability ^ of being selected), 
otherwise (with probability 1 — St) play arm j such that 

Xj > Xi Vz 

else 

play arm j such that 

Xj > X, Vz 

end 

end 

Get reward G{t)Xj-, 

Update Xj ; 

end 


Theorem 3.1 (e-greedy algorithm with hard threshold). The bound on the mean regret at time n is given by 

m 

E[i?„] < ^G(j)A, (3) 

n 

71 

+ X! G{t)t{G(t)>z} 

where 

(i) ‘‘■s (i) 


The sum in ([^ is the exact mean regret during the initialization phase of Algorithm [T] In Q we have a bound on the 
expected regret for turns that present low values of G(f), where the quantity in the parenthesis is the bound on the probability 
of playing arm j: Pj (f) is the bound on the probability that arm j is considered being the best arm at round t, and 1 /m is the 
probability of choosing arm j when the choice is made at random. Finally, in (|^ we have a bound on the expected regret for 
turns that present high values of G{t) and in this case we consider only the probability fij{i) that arm j is the best arm since 
we do not explore at random during high reward periods. The usual e-greedy algorithm is a special case when G{t) = 1 Vt 
and z > 1. Notice that e* is a quantity 9 (l/f), while fij(t) is o (l/^, so that an asymptotic logarithmic bound in n holds for 
if t grows at the same rate as t (because of the logarithmic bound on the harmonic series). 

We want to compare this bound with the one of the usual version of the e-greedy algorithm but, since the old version is 
not well suited for the setting in which the rewards are altered by the multiplier function, we discount the rewards obtained 
at each round (by simply dividing them by G(f)) so that it can also produce accurate estimates of the mean reward for each 
arm. This “smarter” version of the e-greedy algorithm is presented in Algorithm|^(Section|^. The bound on the probability 
of playing a suboptimal arm j for the usual e-greedy algorithm is given by (3j (t) (i.e. /3j (t) when t = t) and we refer to it as 
/3°*‘*(f). In general, is lower than /3j(t) (since t < t). Intuitively, this reflects the fact that the new algorithm performs 

fewer exploration steps. Moreover, in the usual e-greedy algorithm, the probability of choosing arm j at time t is given by 

P(K“‘=i}) =G- + (l-G)/3f(f), 

' m 



(4) 


(5) 


4 / f 


' Aj \mkeJ 

(6) 
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which is less than the probability of the new algorithm in case of low G{t) 


m 

but can easily be higher than the probability of the new algorithm in case of high rewards (which is given by only In 

fact. 


p {{If=j}) - p ({/r=j}) = ^ + (1 - f 

if f > km we get 

if f < km we get 

( 8 ) 

and for t large enough both expressions are positive since Pjit) is o {l/t) and we assume that i is 9{t). Having 0 and 
([^ positive means that if we are in a high-rewards period the probability of choosing a suboptimal arm decreases faster in 
Algorithm[T] In that case. Algorithm [T] would have lower regret than the e-greedy algorithm. 

In practice, the threshold z should be defined as argmin(E[i?„]). If this is too computationally challenging, but past data 
are available, a good value for z can be chosen using cross validation techniques, i.e. by trying different thresholds with the 
available data and by choosing the one that yields the best performance. 

The following Corollary illustrate the benefits of the bound in a simple scenario when the multiplier function can only 
take two values and the regulating threshold divides the higher value from the lower one. 

Corollary 3.1. Suppose the greed function G{t) takes only two values: gio„ and ghigh- Af each turn t it takes the value 
glow for a fraction q of the turns played, and the value gnighfor the remaining t — qt turns (for example, if q = 1 / 2 , 
G(t) alternates at each turn between gio„ and ghigh)- Then, the bound on the expected regret at turn n reduces to: 

E[i?„] < 0(1) 

k " 

+ ~IStotgiow ^{G{t)=gi„„,} 

^ t—m+1 

n 

+ IStot ghigh 

where A,„, = 


The term that hurts regret the most (1/f) is multiplied only by giow. and not by ghigh- When the rewards are high (and so is 
the possible regret), only terms of order o{l/t) are present. If exploration were permitted during the high reward zone, there 
would have been large terms of ghigh/1, which is what the algorithm is designed to avoid. 



3.3 Soft e-greedy algorithm 

We present in Algorithm|^a “soft version” of the e-greedy algorithm where greed is regulated gradually (in contrast with the 
hard threshold of the previous section). Again, in high reward zones, exploitation will be preferred, while in low reward zones 
the algorithm will explore the arms more. Let us define the following function 



(9) 


and let 7 = minsg{m+i, -,«} f’is). Notice that 0 < '0(f) < 1 Vf and that its values are close to 0 when G{t) is high, while 
they are close to 1 for low values of G{t). The new probabilities of exploration during the game are given at each turn t by 
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St = min ^ }. In this way, we still maintain the linear decay of the probabilities of exploration, but we push them to 

zero to avoid high regrets when the multiplier function G{t) is high. We generally assume that rninsg^m+i, - ,n} G{s) is not 
smaller than 1. The usual case is recovered when G{t) = 1 for all t. 


Algorithm 2: Soft e-greedy algorithm 

Input : number of rounds n, number of arms m, a constant k > 10, such that k > ^ , sequences 

{ear=i = min and {G{t)}U 

Initialization; play all arms once and initialize Xj (defined in ([T]|) for each j = 1, • • • ,m 

for f = m + 1 to n do 

With probability et play an arm uniformly at random (each arm has probability ^ of being selected), 
otherwise (with probability 1 — et) play arm j such that 

Xj > X, Vi 

Get reward G{t)Xj-, 

Update Xj ; 

end 


The following theorem (proved in Appendix [B]l shows that a logarithmic bound holds in this case too (because the et are 
9 (1/f) and their sum over time is logarithmically bounded, while the term is o (l/f))- 


Theorem 3.2 (Regret-bound for soft-e-greedy algorithm). The bound on the mean regret at time n is given by 

m 

E[i?„] < ( 10 ) 

- ^ ' ( 1 .) 


i=i 




where 


/3f (f) = k 


/ yt 

V mke 


log 


7f 


4 / 7 i 


mke J \mke 


( 12 ) 


The sum in ( [T0| ) is the exact mean regret during the initialization of Algorithm]^ For the rounds after the initialization 
phase, the quantity in the parenthesis of Q is the bound on the probability of playing arm j (where (3^ {t) is the bound on 
the probability that arm j is the best arm at round t, and 1 jm is the probability of choosing arm j when the choice is made at 
random). 

As before, we want to compare this bound with the “smarter” version of the e-greedy algorithm presented in Algorithm 
1^ In the usual e-greedy algorithm, after the “critical time” n' — km, the probability ¥{Xj^Tj{t-i) > of arm 

j being the current best arm, can be bounded by a quantity that is o (1/f) as t grows. Before time n', the decay of 

P(Aj T (t-i) > Xi is faster and the bound is a quantity that is o VA as t grows (see Remarkj^in Appendix 

0 - The probability of choosing a suboptimal arm j changes as follows: 


• iff < n', P({/t =j}) = 

• if t > n', P({/t = j}) = f + (l — , which is 9 (i) as t grows. 


In the soft-e-greedy algorithm, before time w defined as w = argmin /(s), subject to /(s) < 7 , where /(s) = we have 
that 13j (t), which is the bound on the probability P(Aj 7 ^> X^ j^. (t-i)) of arm j being the current best arm, is a quantity 
that is o ( 1 /( 7 ^)^), VA as t grows (the argument is similar to the Remarkj^in Appendix [ a|). After w, it can be bounded by a 
quantity that is o ( 1 /( 7 ^)) as t grows. The probability of choosing a suboptimal arm j changes as follows: 

• if f < n', P({/t = j}) = + {1- V’(f))/3/(f); 


• if n' < t < w, P({/t = j}) 


i min {m, } + (1 - min {m, T }) 


1 










• if t>w, ¥{{It = j}) = f + (1 - f) pfit). 


In order to interpret these quantities, let us see what happens for high or low values of the multiplier G{t) as t grows in Table 
[T] For brevity, we abuse notation when using Landau’s symbols, because in some cases t is not allowed to go to infinity; it 
is convenient to still use the “little o” notation to compare the decay rates of the probabilities of choosing a suboptimal arm, 
which also gives a qualitative explanation of what happens when using the algorithms. For the soft-£-algorithm, the rate at 
which the probability of choosing a suboptimal arm decays is faster when G{t) is high, and worse when G{t) is low. Notice 
that the parameter 7 slows down the decay with respect to the usual e-greedy algorithm. This is direct consequence of the 
slower exploration. An example of a typical behavior of and ef^ is shown in Figure]^ where G{t) = 20 + 19 sin(f/2). 

Table 1: Summary of the decay rate of the probabilities of choosing a suboptimal arm for the soft-£-greedy algorithm and the 
usual e-greedy algorithm (supposing it is taking in account the time-patterns.) The decay depends on the time-regions of the 
game presented in Figure 


Region 

round 

G{t) 

nut= j}r^^ 

nut= 

= j}r^ < =j})°“ ? 

1 

t < n' 

high 

low 

m 

m 

close to — 

m 

yes, much better 

no, but not by much 

2 

n' <t <w 

high 

low 


^(t) 

yes, much better 

yes, but not by much 

3 

t > w 

high 

low 


®(i) +0(7) 

no, but not by much 

no, but not by much 


Figure 3: Comparison of probabilities of exploration over the number of rounds. Before n', ef*^ is 1 and always greater than 
'!/;(f). After w, ef"* is always less than '0(f). 
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3.4 Regulating greed in the UCB algorithm 

Following what has been presented to improve the e-greedy algorithm in this setting, we introduce in Algorithm]^ a modifi¬ 
cation of the UCB algorithm. We again set a threshold z and, if the multiplier of the rewards G{t) is above this level, the new 
algorithm exploits the best arm. When G{t) is under the threshold, the algorithm is going to play the arm with the highest 
upper confidence bound on the mean estimate. 


Algorithm 3: UCB algorithm with regulating threshold 
Input ; number of rounds n, number of arms m, threshold z, sequence {G{t)}]Gi 

Initialization: play all arms once and initialize Xj (as defined in Q) for each j = 1, • • • , m 
for f = m -I-1 to n do 
if (G(f) < z) then 

play arm j with the highest upper confidence bound on the mean estimate 


else 

play arm j such that 

end 

end 

Get reward G{t)Xj-, 

Update Xj-, 




2 logf 


Xj > Ai Vf; 


It is possible to prove that also in this case the regret can be bounded logarithmically in n. Let B = {t : G{t — 1) < 
z, G{t) > z} be the set of rounds where the high-reward zone is entered, and let Tt be the last round of the high-reward 
zone that was entered at time t. Let us call t/i, y 2 , • ’' tUb the elements of B and order them in increasing order such that 
2/1 < 2/2 < • • ■ < Ub- Let us also define for every fc € {1, • • • , |i3|} the set Yk = {t t > yk,G{t) > z,t < yk+i} (where 
ys+i = n) of times in the high-reward period entered at time y^, and let = max^gy^, G{t) the highest value of G{t) on 
Yk- Finally, for every k, let Rk = AfelYfej. 

Now, given a game of n total rounds, we can “collapse” the fcth high reward zone into the entering time yk by defining 
G{yk) = Rk, for all k. Now, the maximum regret over B is given by (max^ A^) Rk- By eliminating the set B from 

the game, we have transformed the original game into a shorter one, with rj steps, where G{t) is bounded by z and the usual 
UCB algorithm is played. When the size of set B decreases with n, (is of order 0{l/t) after an arbitrary time), the total regret 
has a logarithmic bound in n. 

The £-greedy methods are more amenable to this type of analysis than UCB methods, because the proofs require bounds 
on the probability of choosing the wrong arm at each turn. The UCB proof instead require us to bound the expected number 
of times the suboptimal arms are played, without regard to when those arms were chosen. We were able to avoid using 
the maximum of the G{t) values in the e-greedy proofs, but this is unavoidable in the UCB proofs without leaving terms 
in the bound that cannot be explicitly calculated or simplified (an alternate proof would use weaker Central Limit Theorem 
arguments). 
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Theorem 3.3 (Regret-bound for the regulated UCB algorithm). The bound on the mean regret at time n is 

given by 


E[i?„] < 


-f 


-f 


m 

Eg(j)a, 

f=i 



|B| 

(max Aj) 

^ k=l 



(13) 

(14) 

(15) 


The first sum in ( [T3| ) is the exact mean regret of the initialization phase of Algorithm!^ the third sum in ( [T5| ) is the bound 
on the regret from the high-reward zones that have been collapsed, and the second term in^f^ is the bound on the regret for p 
rounds when G{t) is under the threshold z and it follows from the usual bound on the UCB algorithm (for n rounds the UCB 

algorithm has a mean regret bounded by ^) '^T=^ 

Again, the threshold z should be defined as argmin(E[ii„]) or, if past data are available, z can be chosen using cross valida¬ 
tion. 


3.5 The soft UCB algorithm 


In Algorithm]^ present now a “soft version” of the UCB algorithm where greed is regulated gradually (in contrast with the 
hard threshold of the previous section). Again, in high reward zones, exploitation will be preferred, while in low reward zones 
the algorithm will explore the arms. 

Let us define the following function; 


m = 




(16) 


At each turn t of the game, the algorithm plays the arm with the highest upper confidence bound on the mean estimate, but, 
with the introduction of ^{t), the confidence interval around Xj Xjit-i) is built in a way such that, when G{t) is high, it 
collapses on the estimate itself, forcing the player to choose the arm with the highest mean estimate (thus, leading to a pure 
exploitation policy). In contrast, when the multiplier G{t) is low, the confidence interval around Xj^Tj(t-i) stretches out, 
making the player explore more easily arms with high uncertainty. 

One of the main difficulties of the formulation of these bounds is to define a correct functional form for ^(f) so that it is 
possible to obtain smoothness in the arm decision, reasonable Chernoff-Hoeffding inequality bounds while working out the 
proof (see Appendix[C]l, and a convergent series (the second summation in ([TS])). 


Algorithm 4: Soft UCB algorithm 
Input : number of rounds n, number of arms m, sequence 

Initialization; play all arms once and initialize Xj (as defined in ([T]i) for each j = 1, • • • , m 

for f = m -I-1 to n do 

play arm j with the highest upper confidence bound on the mean estimate; 




2 logg(t) 


Get reward G{t)Xj-, 
Update Xj ; 


Also in this case, it is possible to achieve a bound that grows logarithmically in n. 
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Theorem 3.4 (Regret-bound for soft-UCB algorithm). The bound on the mean regret E[i?„] at time n is given by 

m 

E[i?„] < ^ G{j)Aj 


(17) 

+ max G{t) ( ^ log r max \ Aj 

n 

l-b ^ - l - mf 

j . (18) 


t=m + l 

/ 


The first sum in ( [TT] ) is the exact mean regret of the initialization phase of Algorithmic For the rounds after the initial¬ 
ization phase, the mean regret is bounded by the quantity in ( |T8] l, which is almost identical to the bound of the usual UCB 
algorithm if we assume G{t) = 1 (i.e., rewards are not modified by the multiplier function). 


4 Experimental results 

We consider three types of multiplier function G{t): 

• The Wave Greed (Figure]^: in this case customers come in waves: G{t) = 21 + 20sin(0.25f). We want to exploit 
the best arm found so far during the peaks, and explore the other arms during low-rewards periods ; 

• The Christmas Greed (Figure |4^: again, G{t) = 21-1-20 sin(0.25f), but when t G {650,651, • • • , 670}, G(t) = 1000 
which shows that there is a peak in the rewards offered by the game (which we call “Christmas”, in analogy to the 
phenomenon of the boom of customers during the Christmas holidays); 

• The Step Greed (Figure |^: this case is similar to the Wave Greed case, but this time the function is not smooth: 
G(f) = 200, but for f G {600,601,-•• , 800}U{1000,1001, • • • , 1200}U{1400,1402, • • • , 1600} we have G(f) = 400 


Figure 4: Shapes of the multiplier functions used in the experiments. 


(a) The Wave Greed 


(b) The Christmas Greed 


(c) The Step Greed 



We consider a game with 500 arms and normally distributed rewards. Each arm j G {1, • • • , 500} has mean reward 
p,j = 0.1 -1- (200 -I- 1.5(500 — j + 1))/(1.5 X 500) and common standard deviation a = 0.05. The arms were chosen in this 
way so that Xj would take values (with high probability) in [0,1]. Flaving a bounded support for Xj is a standard assumption 
made when proving regret bounds (see Auer et al. 1 2002| ). We play 2000 rounds each game. After 2000 rounds the algorithms 
all essentially have determined which arm is the best and tend to perform very similarly from that point onwards. 

The well-known UCB and e-greedy algorithms are not suitable for the setting in which the rewards are altered by the 
multiplier function. Thus, in their current form, we can not compare directly with them. The fact that rewards are multiplied 
would irremediably bias all the estimations of the mean rewards, leading UCB and e-greedy to choose arms that look good 
just because they happened to be played in a high reward period. For example, suppose we show an ad on a website at lunch 
time: many people will see it because at that time the web-surfing is at its peak (i.e., the G{t) multiplier is high). So even if 
the ad was bad, we may register more clicks than a good ad showed at 3:00AM (i.e., the G{t) multiplier is low). To obtain a 
fair comparison, we created “smarter” versions of the UCB and e-greedy algorithms in which the rewards are discounted at 
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each round (by simply dividing them by G{t)) so that also the old version of the algorithms can be smarter in that they can 
produce accurate estimates of the mean reward for each arm. The smarter version of the usual UCB algorithm is presented in 
Algorithm]^ and the one for the e-greedy algorithm is shown in Algorithm]^ For the three multiplier functions, we report the 
performance of the algorithms in Figures and[^ 

In Figures and we change the rewards to have a Bernoulli distribution (the assumption of bounded support is 
verihed). Similarly to the normal case, each arm j G {1, • • • , 500} has probability of success pj = 0.1 + (200 + 1.5(500 — 
j + 1))/(1.5 X 500). One of the advantages of the £-greedy algorithm is that there are no assumptions on the distribution of 
the rewards, while in UCB they need bounded support ([0,1] for convenience, so it is easier to use Hoeffding’s inequality). 

Figure 5; Comparison for the Wave Greed case. 


(a) e-greedy algorithms rewards. 


(b) UCB algorithms rewards. 


(c) Final rewards comparison 





Figure 6: Comparison for the Christmas Greed case. 


(a) e-greedy algorithms rewards. 




(c) Final rewards comparison 



UCB algorithms Epsilon greedy algorithms 


Figure 7: Comparison for the Step Greed case. 


(a) e-greedy algorithms rewards. 



(b) UCB algorithms rewards. 


(c) Final rewards comparison. 




UCB algorithms Epsilon greedy algorithms 
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Algorithm 5: Smarter version of the usual UCB algorithm 
Input ; number of rounds n, number of arms m, sequence 

Initialization: play all arms once and initialize Xj (as defined in Q) for each j = 1, • • • , m 

for f = m + 1 to n do 

play arm j with the highest upper confidence bound on the mean estimate: 




2 log(f) 


Get reward G{t)Xj\ 
Update Xj\ 


Algorithm 6: Smarter version of the usual e-greedy algorithm 
Input : number of rounds n, number of arms m, a constant c > 10, a constant d such that d < min^ Aj and 

0 < d < 1 , sequences = min {l, and 

Initialization: play all arms once and initialize Xj (as defined in Q) for each j = 1, • • • , m 

for f = m + 1 to n do 

with probability St play an arm uniformly at random (each arm has probability — of being selected), 
otherwise (with probability 1 — et) play arm j such that 

Xj > X, Vi 

Get reward G{t)Xj-, 

Update Xj-, 

end 
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Figure 8: Comparison for the Wave Greed case. 


(a) e-greedy algorithms rewards. 


(b) UCB algorithms rewards. 


(c) Final rewards comparison. 





UCB algorithms Epsilon greedy algorithms 


Figure 9: Comparison for the Christmas Greed case. 


(a) e-greedy algorithms rewards. 


(b) UCB algorithms rewards. 


(c) Final rewards comparison. 





UCB algorithms 


Epsilon greedy algorithms 


Figure 10: Comparison for the Step Greed case. 


(a) e-greedy algorithms rewards. 




(c) Final rewards comparison. 



UCB algorithms Epsilon greedy algorithms 


4.1 Discussion on Yahoo! contest 

The motivation of this work comes from a high scoring entry in the Exploration and Exploitation 3 contest, where the goal 
was to build a better recommender system for Yahoo! Eront Page news article recommendations. The contest data, which 
was from Yahoo! and allows for unbiased evaluations, is described by Li et al. |2010|. These data had several challenging 


characteristics, including broad trends over time in click through rate, arms (news articles) appearing and disappearing over 
time, the inability to access the data in order to cross-validate, and other complexities. This paper does not aim to handle all 
of these, but only the one which led to a key insight in increased performance, which is the regulation of greed over time. 
Although there were features available for each time, none of the contestants were able to successfully use the features to 
substantially boost performance, and the exploration/exploitation aspects turned out to be more important. Here are the main 
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insights leading to large performance gains, all involving regulating greed over time: 

• “Peak grabber”: Stop exploration when a good arm appears. Specifically, when the article was clicked 9/100 times, 
keep showing it and stop exploration all together until the arm’s click through rate drops below that of another arm. 
Since this strategy does not handle the massive global trends we observed in the data, it needed to be modified as 
follows: 

• “Dynamic peak grabber”: Stop exploration when the click through rate of one arm is at least 15% above that of the 
global click through rate. 

• Stop exploring old articles: We can determine approximately how long the arm is likely to stay, and we reduce explo¬ 
ration gradually as the arm gets older. 

• Do not fully explore new arms: When a new arm appears, do not use 1 as the upper confidence bound for the probability 
of click, which would force a UCB algorithm to explore it, use .88 instead. This allows the algorithm to continue 
exploiting the arms that are known to be good rather than exploring new ones. 

The peak grabber strategies inspired the abstracted setting here, where one can think of a good article appearing during 
periods of high G{t), where we would want to limit exploration; however, the other strategies are also relevant cases where 
the exploration/exploitation tradeoff is regulated over time. There were no “lock-up” periods in the contest dataset, though 
as discussed earlier, the G{t) function is also relevant for modeling that setting. The large global trends we observed in the 
contest data click through rates are very relevant to the G{t) model, since obviously one would want to explore less when the 
click rate is high in order to get more clicks overall. 


5 Conclusions 

The dynamic trends we observe in most retail and marketing settings are dramatic. It is possible that understanding these 
dynamics and how to take advantage of them is central to the success of multi-armed bandit algorithms. We showed in this 
work how to adapt regret bound analysis to this setting, where we now need to consider not only how many times an arm 
was pulled in the past, but precisely when the arm was pulled. The key element of our algorithms is that they regulate greed 
(exploitation) over time, where during high reward periods, less exploration is performed. 

There are many possible extensions to this work. In particular, if G{t) is not known in advance, it may be easy to 
estimate from data in real time, as in the dynamic peak grabber strategy. The analysis of the algorithms in this paper could be 
extended to other important multi-armed bandit algorithms besides £-greedy and UCB. Further,future work will consider the 
connection of mortal bandits (with appearing/disappearing arms) with the G{t) setting, since for mortal bandits, each bandit’s 
G{t) function can change at a different rate. 
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A Regret-bound for £-greedy algorithm with hard threshold 


The regret at round n is given by 

n ra 

(19) 

t=i 

where G{t) is the greed function evaluated at time t, is an indicator function equal to 1 if arm j is played at time t 

(otherwise its value is 0) and Aj = fi* — fij is the difference between the mean of the best arm reward distribution and the 
mean of the j’s arm reward distribution. By considering the threshold z which determines which rule is applied to decide 
what arm to play, we can rewrite the regret as 

n m 

Rn = E E + 

t=l j=l 
n m 

t=i j=i 

By taking the expectation we have that 

n m 

nRn] = E E = j}) + 

t=i j=i 
n m 

+ Y.Y. A,G(f)l{G(t)>4P({^t = J})> 

t=i i=i 

which can be rewritten as 

n m 

HRn] = E E AjG(f)l{G(t)<^} 

t=l j=l 
n m 

+ EE Vi). (20) 

t=i i=i 


£*— + (1 — £t)P(E,G(*-i) — Vi) 


For the rounds of the algorithm where G(t) < z, we are in the standard setting, so for those times, we follow the standard 
proof of |Auer et al.lpOO^ . For the times that are over the threshold, we need to create a separate bound. Let us now bound 
the probability of playing the sub-optimal arm j at time t when the greed function is above the threshold z. 




< 


Aj,Tj(t-l) > Fj + ^ ) + IP ( < F*- TT 


where the last inequality follows from the fact that 


A, 


A„ 


(t-i) < -^ E 1 ^ Fj + ^ 


In fact, suppose that there exist an element uj G > Xif^T,{t-i) | that does not belong to 

U |E\Tj(t-i) > Fi + %-})■ Then, we would have that 


A, 


W G I < X^^T,(t-l) < -^ E I "Ai,Tj(t-l) > ^ 


A, 


A, 


- \ A*,T.(t-i) > -^ E i < Fi + f ) 


A, 


( 21 ) 

( 22 ) 

(23) 


(24) 

(25) 


but from the intersection of events given in ( |25] l it follows that A* > /r* — ^ = + ^ > Xj^Tj{t-i) which 


contradicts uj G 


{E, 
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Therefore, all elements of > X,T 4 t-i)} belong to < ^* - ^} U > Mi + 

Let us consider the first term of (the computations for the second term are similar), 

t-i 

= 

t-i 

= E^ 

< E^ - 1 ) = ^ + t) (26) 


Aj 

Xj,Tj(t-i) > Mi + -^ 


(^ji^ 1) — ^ Mi + 2^ ^ 

- 1) = s|E.s ^ Mj + ^ Mi + 


where in the last inequality we used the Chernoff-Hoeffdings bound. Let us define — 1) as the number of times arm 

j is played at random (note that Tj^{t — 1) < Tj{t — 1) and that T^{t — 1) = where is a Bernoulli r.v. with 

parameter e^jm), and let us define 


1 ‘ 


S = 1 


where t is the number of rounds played under the threshold z up to time t. Then, 


@ < 


< 


< 


< 


L^^oJ 


S = 1 

La:oJ 


A, 


E “ 1) — ®lE.s ^ Mi + ^ ) + E 


s=L 2 :oJ +1 


s=l 


A, 


E! ^ 1) “ ®|A^i.s — Mi + 9 ) ”b ^2' 


=-^La:oJ 


J2 P (Tfit - 1) < , > /r, + ^ j + L^oJ 

LxoJP(i;«(f-l)< boJ) + Le-^L.»J 

^i 


(27) 


-XL 

where for the first [ccqJ terms of the sum we upperbounded e~~‘^ by 1 , and for the remaining terms we used the fact that 
where in our case k = ^. We have that 


E[T/(f - 1)] = - E^- Var{Tfit - 1)) = ^ (f - < 1 ^ e, = E[T/(f - 1)], 

^ J m m \ m/ m 


and, using the Bernstein inequality P(S'„ < E[S'„] — a) < exp{— ^°E/ 2 } with Sn = Tj^{t — 1) and a = |E[Tj^(f — 1)], 

1 , 


P(Tf(f-l)< LxoJ) = P(Tf(f-l)<E[Tf(f-l)]--E[Tf(f-l)] 

i(E[rf(f-l)])2 ] 


< exp < — 


E[rf(f-i)] + iE[rf(f-i)] 


41 


= exp <{ -5 gEl?;- (t - 1 )] (^ = exp { boj )■ ■ 


1 


(28) 
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Now we need a lower bound on [a;oJ. Let us define n' = \ km\ , then 


Xq = 


1 ‘ 


2m 

1 

2m 


S = 1 

i 


^min<^ 1, 


S = 1 


1 ^ 1 ^ 
— > 1 + — > 
2m ^ 2m ^ 


km 


S = 1 

n' ^ 1 
2m 2m 


km 

s 


s—n'-\-l 

km km 

s s 

, s—l s—1 


- ^ ^ + 1) - (log(n') + log(e)) 


n' 1 


^ -L +ol°g L77 


k , 

2 log 


m k 

i 

mke 


t 


(29) 


Remark 1. Note that if i (or t in the usual e-greedy algorithm) was less than n', then we would have xq = i/2m, yielding an 
exponential decay of the bound on the probability of j being the best arm. To see this, t < n' would imply that, using ( [ZT] ) 
and ( 


t 


1 t 


A? t 




5 2m \ 


2 2m 


Continuing the proof of Theorem 3.1, we obtain a bound on the first term in ( |2^ as follows. Using (j2^ combined with 
in we get that 


t 


log 


t 

mke 


A? 


2 \mke 

Since the computations for the second term in (123 are similar, a bound on 


t 

mke 


(30) 


I3j{i) = k 


( ^ 


V mke 


log 


t 


> X^^T,{t-i) Vz) is given by 

3 

4 / t 


mke J Aj \mke 


(31) 


We have now an upper bound for V{Xj Xj{t-i) > V*)- We can use this to easily bound V{Xj^Tj(t-i) ^ 

^i,Ti (i_i) Vi) in ( |20l i which yields the following bound on the mean regret at time n: 

m 

E[i?„] < 5 ]G(j)A, 
i=i 






This, combined with the bound f3j (t) above, proves the theorem. 


B Logarithmic bound for Soft-£-greedy algorithm 

At each round t, arm j is played with probability 

^ + (1 - et)P (X, > X, Vi) , 
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where St = min {ipit), and 


ip{t) = 


log (l + G^) 


log (^1 


= e{m+i, 


n)G{s)^ 


Recall that 7 = mini<t<„ '0(i). 

Let us bound the probability P({/t = j}) of playing the sub-optimal arm j at time t. We have that 




Tj(t-l) > Vi j ^ 

< 


) < 




A, 


For the bound on the two addends in (|3^, we have identical steps to the proof for Theorem 1, and thus 


and we again have 


Now we need a lower bound on [a;oJ. Let t > w where w = min{l, • • • ,n} such that < 7 . Then 


(32) 


(33) 


11 “ (j^jXjit-1) > fj-j + -Y^ 

< [a;ojp(i;«(f-i)<boj) + ^e-^L."J 

( 34 ) 

E ^ + -Y^ 

< [a;oJP(i;«(f-l)<boJ) + ^e-^L."J 

( 35 ) 

p(rf(f- 

■ 1 ) < boj) < exp|-^[a:oJ j . 

( 36 ) 


a^o = > e.s 

2m 

S — 1 


E' 


2m 


Emin-^ ip{s) 


km 


1 1 fkm km\ 

- ^■E3' +^ I — I 

s—1 \s=l s — 1 / 

> ^ ^ (log(^ + 1) - (log(w) + log{e)) 

> ^ogfErVlog(^ 


= 2l°g 


m k 

it A 

mke J 


(37) 


Using ([37|i in ( |3^ , combined with (|3^ and ( |T5l ), from ( |3^ the bound on P \/i^ is given by 


P] (t) = k log 


I 4 / 7f 

mke J \mke J \mke 


Since the mean regret is given by 

n m 

t=i j=i 

the bound on the mean regret at time n is given by 

Ttl 

EK] < 5]G(j)A, 
i=i 


(38) 


n m / ^ \ 


t—m-\-l j — 1 
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C The regret bound of the soft UCB algorithm 

The regret at round n is given by 


n m 


Rn = E 

j — 1 — l 


^ E ~ 

i=i 

The expected regret E[i?„] at round n is bounded by 


E[R^]<Y,G{j)A, 

i=i 


m n 


niax G(t) E 

^ ^ i=m+l 


max E ■ 

,n} / 

i=i 


(39) 


where Tj{n) = X^tLi number of times the sub-optimal arm j has been chosen up to round n. Recall from 0 

that 

, T,(t-i) 

E W) 


s=l 


From the Chernoff-Hoeffding Inequality we have that 






E ^ 3 ,i - Mi < -e < exp{-2rj (f - l)e^}, 


and 


ECi-l) 


T~(h^ E - Mi > e I < exp{-2T, (f - l)^^}. 


Let us define the following function; 


C(i)= 1 + 


G{t) 


by selecting e = \ j we have 




I ‘2 log m 
m-i) 


< M, < 


and 


(41) 


(42) 


E l^iogm 


> M, < m 


-4 


Equivalently, we may write for every j 


Mj 


Mj 


' 2 1ogg(f) 
^ 2 1ogg(f) 


< Xj with probability at least 1 — ^(f) 


-4 


> Xj with probability at least 1 — ^(f) 


-4 


If we choose arm j at round t (i.e., the event {/( = j} occurs) we have that 

Xj + 


12 log^(f) 


m-i) 


>x. 


2 log^jt) 


(43) 

(44) 

(45) 

(46) 
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Let us use ( |45| l to upper bound the LHS and ( |44l l to lower bound the RHS of ( |46l l, then we get with probability at least 

1 - 2C(f)-4 


fij + 2 ^ 

from which we get with probability at least 1 — 


1 2 log$(f) 


> At*, 






(47) 


In order to emphasize the dependence of Xj from Tj (f — 1) we will sometimes write Xjj’. (t-i) ■ In the following, notice that 
in ( |49| ) the summation starts from to + 1 because in the first to initialization rounds each arm is played once. Moreover, step 
(|50ll follows from (|49ll by assuming that arm j has already been played u times. By using (|46ll we get ([5T]l, then, for each t. 


'21ogC(f) 


/21og^(f) 1 


max Xj s .+ 


'21og^(f) 


> 






7 


which justifies ( [52] i. We also have that ( |48] l is included in 




Thus, for any integer u, we may write 

n 

Tj{n) = 1 + XI 

t=m + l 


'+ X =i,'Ll*-t) ^ 

t=m + l 


- X 

t= 7 n + l 


21ogg(t) . y 


21 ogg(t) 
r*(i- 1 ) 


Tj{t -l)>u 


< u+ X It 

t=m + l 


max Xj s 


2 log 5 (t) 


min W,s, + 


n T*(t— 1 ) r'j 1 ) 

^ E E E i-iE. 

t=m-\-l s*=l Sj=u 


21 og$(i)^- _ / 21 og^(t) I 


When 


1 S Xj + 


1 2 log^(f) 


> X, 


/ 2 1ogg(t) l 


is equal to one, at least one of the following has to be true; 


(48) 


21 ogf(t) I 


(49) 

(50) 

(51) 

(52) 

(53) 

(54) 


X* 

< 

y 


(55) 

E 

> 

At j + y 

I 2 log^(f) 

(56) 



< 

fij + 2t 

l2\ogm 

(57) 


(In fact, suppose none of them hold simultaneously. Then from ( |55l ) we would have that X* > /Lt* - Y 


applying ( |57l i (with opposite verse since we are assuming it does not hold) we get X* > fij + 


and 
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then from ( [5^ (again, with opposite verse) follows that X* > Xj + ^ ~ y which is in contradiction with 

.) Now, if we set u = 


.2 log max ^(t) 

' tG{m+l,- -,n} 


2 

^ /tiji 2 

— 


, for Tj(t — 1) > u, 

/ 21 ogg(f) 


21 ogC(f) 


\ 


log^(i) 


log ( max £{t) 

' ,n} 


^ fJ-H: fjij 


therefore, with this choice of u, (|57]l can not hold. 
Thus, using ([53|l, we have that 


Tj{n) < 


log max £{t) 


T.(t-l) Tjit-l) 

E E 1 

s.,=l Sj—U 


j2\og£{t) 

V s. 

T.(t-l) 1 

E E 1 

8,^=1 Sj—U 1 

> Mi + ^ 

/21oge(f) 

/ Sj- 


and by taking expectation, 

E[r,(n)] < 


^ log ( max f(f)) 

A2 ^ \te{m+l,-,n} J 


T.(i-i) rpi-1) 

E E H 

8,^=1 Sj—U 


/21ogC(f) 

V s* 

E E H 

8^=1 


hiogm 

/ Sj 


< ^log 


( max £{t)\+l + 2 ^ £{t) 


where in the last step we upperbound 1) and Tj(t — 1) by (f — 1 — m) (cases where we have only played the best arm 

or arm j). Therefore, by using (|39ll 


E[i?„] < ^G(j)A, 

i=i 


max G(t) I log 

te{m+l,-,n} ^ M A,- 


max 


c(f) +y]A, 

■ / j=i 


1 + 2 (CW) '‘(f-l-m) 
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D Notation summary 


m: number of arms; 
n: number of rounds; 

G : {1, • ■ • , n} —>■ R^; known multiplier function; 

Xj (t): unsealed random reward for playing arm j ; 

Xj{t)G{t): actual reward; 

/r*: mean reward of the optimal arm (/r* = maxi<j<m ^j ); 

Aj ; difference between the mean reward of the optimal arm and the mean reward of arm j (Aj = /r* — /Xj); 
Xj : current estimate of /ij ; 

It- arm played at turn t\ 

Tj{t — 1); number of times arm j has been played before round t starts; 

2 threshold (used in Algorithm[^and Algorithm]^; 
t: number of rounds under the threshold 2 up to time t; 

in Algorithmand Algorithm^ 


k: a constant greater than 10 such that k > . ^ > 

c; a constant greater than 10 in Algorithm]^ 

d: a constant such that d < min^ Aj and 0 < d < 1 in Algorithmic 

et'- probability of exploration at turn t (used in Algorithmic and Algorithmic; 

Pj (t): upper bound on the probability of considering suboptimal arm j being the best arm at round t when using Algorithm 

0 

Pf'^ (t): upper bound on the probability of considering suboptimal arm j being the best arm at round t when using Algorithm 

E 

Pj {t): upper bound on the probability of considering suboptimal arm j being the best arm at round t when using Algorithm 

i 


tpit)- smoothing function used to define the probabilities of exploration et in Algorithm|C(see Figure|C; 

7: lowest value of (7 = min^g^^+i ... tpis))-, 

n': particular time defined as km in the comparison between Algorithm|Cand Algorithm|Cin Section 3.3 

w. first round when ^ is less than 7 (m = argmin/(s), subject to /(s) < 7, where /(s) = in the comparison 
between Algorithm|Cand Algorithm|Cin Section |TC 

B set of rounds when the “high reward” zone is entered in Algorithm|C(d3 = {t : G{t — 1) < 2, G{t) > 2}); 

Yk = {t ■. t > yk, G{t) > z,t < yk+i}’- set of rounds in the high-reward period entered at time yu (k £ {1, ■ • ■ , |B|}) in 

Algorithmic 

Ak = maxtgi 7 G{t): highest value of G{t) on Yk in Algorithm|C 
Rk = Afe|Yfe|; the maximum regret of the fcth high reward zone in Algorithm|C 
P{t)\ smoothing function used to define the decision rule in Algorithmic 
e: Euler’s number; 

R„: total regret at round n. 
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